Detect outliered markers, that their physical position on the chromosome, doesnt correlate to their genetic position, as determined by recombination rate.
Characterize these outliered markers (do they deviate from the rest of the markers? in maf, missingnes, hwe values)
OrderMarkers2 map=map4_14_js.txt data=data_f.call useKosambi=1 numMergeIterations=1 sexAveraged=0 outputPhasedData=2 grandparentPhase=1
setwd("/Users/nuriteliash/Documents/GitHub/varroa-linkage-map/varroa_lepmap/data/Q40BIALLDP16HDP40mis.5Chr7/recom_male_fem/")
map_files = list.files(path ="/Users/nuriteliash/Documents/GitHub/varroa-linkage-map/varroa_lepmap/data/Q40BIALLDP16HDP40mis.5Chr7/recom_male_fem/", pattern = "\\mapped$")
# are all markers belonging to one LG locate on the same chromosome?
# make a boxplot showing the chromosome on which all LG markers are located
out_plotChr = list()
for (i in map_files) {
mapped <- read.table(i, header =FALSE, sep ="\t")[,1:4]
names(mapped) = c("Chr","POS","male_position", "female_position")
p_Chr <- ggplot(mapped, aes(x=Chr)) +
geom_bar(stat = "count") +
theme_classic() +
theme(axis.text.x = element_text(size=12, angle = 45,hjust =1))
out_plotChr[[i]] = p_Chr
}
Does the markers’ physical position correlates to its genetic position, as determined by recombination rate?
ggplotly(out_plotSites$order_1.mapped)
#%>% layout(title = list(x=0.01, text = paste0("total size (cM): female = ", sizeFem, "; male: ",sizeMale)))
ggplotly(out_plotSites$order_2.mapped)
ggplotly(out_plotSites$order_3.mapped)
ggplotly(out_plotSites$order_4.mapped)
ggplotly(out_plotSites$order_5.mapped)
ggplotly(out_plotSites$order_6.mapped)
ggplotly(out_plotSites$order_7.mapped)
ggplotly(out_plotSites$order_8.mapped)
ggplotly(out_plotSites$order_9.mapped)
ggplotly(out_plotSites$order_10.mapped)
ggplotly(out_plotSites$order_11.mapped)
ggplotly(out_plotSites$order_12.mapped)
ggplotly(out_plotSites$order_13.mapped)
ggplotly(out_plotSites$order_14.mapped)
OrderMarkers2 mapped=map4_14_js.txt data=data_f.call useKosambi=1 numMergeIterations=100 sexAveraged=0 outputPhasedData=2 grandparentPhase=1 recombination1=0 chromosome=1 ### automatic outliers based on CI
OrderMarkers2 map=map4_14_js.txt data=data_f.call useKosambi=1 numMergeIterations=100 sexAveraged=1 outputPhasedData=2 grandparentPhase=1 recombination2=0 chromosome=1 ### automatic outliers based on CI
#full_join(out_Recom, out_0fem ) %>% full_join(out_0male) %>% distinct() %>% view() # 256 markers
#bind_rows(out_Recom, out_0fem,out_0male) # 576 markers
in the three analyses, the outliered markers are the same 256, in
most LGs.
Next, I will check their:
to see if it is much different from the rest of the markers. if it is, then I should remove them and re-run the linkage mapping
# upload the outlierd markers, detected manually in the plots above
out_recom_manual <- read_delim("/Users/nuriteliash/Documents/GitHub/varroa-linkage-map/varroa_lepmap/data/Q40BIALLDP16HDP40mis.5Chr7/out_recom_manual.csv")
#upload the variants allele frequency (created via vcftools)
var_freq <- read_delim("/Users/nuriteliash/Documents/GitHub/varroa-linkage-map/data/vcf_filter/Q40BIALLDP16HDP40mis.5Chr7/Q40BIALLDP16HDP40mis.5Chr7.frq", delim = "\t", col_names = c("Chr", "POS", "nalleles", "nchr", "a1", "a2"), skip = 1)
#However, this is simply the allele frequencies. To find the minor allele frequency at each site, we need to use a bit of dplyr based code.
# find minor allele frequency
var_freq$maf <- var_freq %>% select(a1, a2) %>% apply(1, function(z) min(z))
var_freq <- mutate(var_freq, site = paste(var_freq$Chr, var_freq$POS))
# Here we used apply on our allele frequencies to return the lowest allele frequency at each variant. We then added these to our dataframe as the variable maf. Next we will plot the distribution.
ggplot(var_freq, aes(maf)) +
geom_density(fill = "dodgerblue1", colour = "black", alpha = 0.3) +
theme_light()
summary(var_freq$maf)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2007 0.2607 0.3388 0.3411 0.4167 0.5000
out_Recom <- left_join(out_recom_manual, var_freq, by= c("POS","Chr"))# %>% rename("Chr"="Chr.x")
#out_0male <- left_join(out_0male, var_freq, by= c("POS","Chr"))
#out_0fem <- left_join(out_0fem, var_freq, by= c("POS","Chr"))
p <- ggplot(var_freq, aes(y=maf, x=Chr)) + geom_boxplot() +
geom_point(data = out_Recom, aes(y=maf, x=Chr, text=POS),colour="red",alpha = 0.5) +
#geom_jitter(data = out_0male, aes(y=maf, x=Chr, text=POS),colour="green",alpha = 0.5) +
#geom_jitter(data = out_0fem, aes(y=maf, x=Chr, text=POS),colour="blue",alpha = 0.5) +
theme_light() + theme(axis.text.x = element_text(angle = 45)) +
ggtitle("Minor allele frequency per site")
ggplotly(p)
The minor allele freq is high, relative to most of the variants.
# upload the outlierd markers, detected manually in the plots above
out_recom_manual <- read_delim("/Users/nuriteliash/Documents/GitHub/varroa-linkage-map/varroa_lepmap/data/Q40BIALLDP16HDP40mis.5Chr7/out_recom_manual.csv")
#upload the Hardy-Weinberg Equilibrium test(created via vcftools)
site_hwe <- read_delim("/Users/nuriteliash/Documents/GitHub/varroa-linkage-map/data/vcf_filter/Q40BIALLDP16HDP40mis.5Chr7/Q40BIALLDP16HDP40mis.5Chr7.hwe", delim = "\t", col_names = c("Chr","POS", "OBS(HOM1/HET/HOM2)", "E(HOM1/HET/HOM2)", "ChiSq_HWE", "P_HWE", "P_HET_DEFICIT", "P_HET_EXCESS"), skip = 1)
# Reports a p-value for each site from a Hardy-Weinberg Equilibrium test (as defined by Wigginton, Cutler and Abecasis (2005)). The resulting file (with suffix ".hwe") also contains the Observed numbers of Homozygotes and Heterozygotes and the corresponding Expected numbers under HWE.
site_hwe <- site_hwe %>%
#mutate(site = paste(site_hwe$Chr, site_hwe$POS)) %>%
select(Chr, POS, P_HWE) %>%
mutate(logpv = log(P_HWE))
summary(site_hwe$logpv)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -116.62 -30.43 -23.52 -24.16 -17.02 0.00
out_Recom <- left_join(out_recom_manual, site_hwe, by= c("POS","Chr"))
#out_0male_join <- left_join(out_0male, site_hwe, by= c("POS","Chr"))
#out_0fem_join <- left_join(out_0fem, site_hwe, by= c("POS","Chr"))
p <- ggplot(site_hwe, aes(y=logpv, x=Chr)) + geom_boxplot() +
geom_point(data = out_Recom, aes(y=logpv, x=Chr, text=POS),colour="red",alpha = 0.5) +
# geom_jitter(data = out_0male_join, aes(y=logpv, x=Chr, text=POS),colour="green",alpha = 0.5) +
# geom_jitter(data = out_0fem_join, aes(y=logpv, x=Chr, text=POS),colour="blue",alpha = 0.5) +
theme_light() + theme(axis.text.x = element_text(angle = 45)) +
ggtitle("LogPval for site deviation from hardy-weinberg equilibrium")
ggplotly(p)
Next up we will look at the proportion of missingness at each variant. This is a measure of how many individuals lack a genotype at a call site.
# upload the outlierd markers, detected manually in the plots above
out_recom_manual <- read_delim("/Users/nuriteliash/Documents/GitHub/varroa-linkage-map/varroa_lepmap/data/Q40BIALLDP16HDP40mis.5Chr7/out_recom_manual.csv")
#upload the site missingness (created via vcftools)
var_miss <- read_delim("/Users/nuriteliash/Documents/GitHub/varroa-linkage-map/data/vcf_filter/Q40BIALLDP16HDP40mis.5Chr7/Q40BIALLDP16HDP40mis.5Chr7.lmiss", delim = "\t",
col_names = c("Chr", "POS", "nchr", "nfiltered", "nmiss", "fmiss"), skip = 1)
summary(var_miss$fmiss)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0583 0.4215 0.4484 0.4456 0.4753 0.5157
#plot all sites and add the outliered positions
out_Recom <- left_join(out_recom_manual, var_miss, by= c("POS","Chr"))
#out_0male <- left_join(out_0male, var_miss, by= c("POS","Chr"))
#out_0fem <- left_join(out_0fem, var_miss, by= c("POS","Chr"))
p <- ggplot(var_miss, aes(y=fmiss, x=Chr)) + geom_boxplot() +
geom_point(data = out_Recom, aes(y=fmiss, x=Chr, text=POS),colour="red",alpha = 0.5) +
#geom_jitter(data = out_0male, aes(y=fmiss, x=Chr, text=POS),colour="green",alpha = 0.5) +
#geom_jitter(data = out_0fem, aes(y=fmiss, x=Chr, text=POS),colour="blue",alpha = 0.5) +
theme_light() + theme(axis.text.x = element_text(angle = 45)) +
ggtitle("% of missingness per site")
ggplotly(p)
the 35,000 variants are already filtered for 0.5 missingness, this is
why the upper limit is 0.5.
its looks like the outlired markers dont have a specifically high
missingness value.
Remove the 34 sites in the out_recom_manual file.
Re-run lepmap.
check again the mapping